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We calculate on GPUs the disconnected diagrams associated with the nucleon form factors and 
moments of generalized parton distributions using Nf=2+1+1 twisted mass fermions. We employ 
the truncated solver method (TSM) for estimating the all-to-all propagators. Due to the fact that 
the TSM involves many low precision stochastic estimators, the usage of GPUs is essential to 
perform efficiently the contractions and the inversions. 
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1. Introduction 

The evaluation of disconnected diagrams is of paramount importance for eliminating system- 
atic errors in the determination of proton and neutron observables. These contribute significantly 
in the evaluation of the v\' mass and strange content of the nucleon, and require a non-perturbative 
evaluation involving all-to-all propagators at a given time slice. This, and the inherent gauge noise 
associated with fermionic loops, explains why most hadron studies neglect these contributions. 

Fortunately, in the recent years there has been progress in algorithms and an increase in com- 
putational power, making these computations feasible. On the algorithmic side, the introduction 
of improvements as the one-end trick, and the truncated solver method (TSM) had led to a sig- 
nificant reduction in the variance of disconnected computations. Using the properties of twisted 
mass fermions, one can further reduce the variance in isoscalar quantities by taking appropriate 
combinations of two flavors of twisted mass fermions. On the hardware side, GPU units provide a 
large speed-up in the evaluation of quark propagators and contractions. For the TSM, they provide 
an optimal platform for swiftly increasing the amount of measurement we can perform. 



2. Methods for disconnected calculations 



2.1 Stochastic estimation 



The exact computation of all-to-all propagators for the lattice volumes of physical interest is 
outside the current computer power. The fermionic matrix size ranges from ~ 10 6 to ~ 10 9 in 
the largest volumes, thus an exact computation of the inverse would require an equal number of 
inversions, and the situation for timeslice-to-all propagators is equally unfeasible. A way to make 
progress is to compute an unbiased stochastic estimation of the propagator [1]: we generate a set 
of N sources \r]j) randomly, by filling each component of the source with a number, in our case a 
particular representation of the Z2 or Z4 group. Then the sources have the following properties: 



The first property ensures that our estimation is unbiased. The second one allows us to reconstruct 
the inverse matrix by solving for \sj) in M \sj) = |t7 7 ) and calculating 

M E '-=ljt\si){T]j\^M- 1 . (2.2) 
iV 7=1 

This way the computation becomes feasible, although it is still expensive due to the high number 
of inversions required to achieve a good estimate of M~ l in Eq. (2.2). 
The deviation of the estimator from the exact solution is given by 

M- l -M E l =M- l x fl-^f>)(r?|Y (2-3) 

From Eq. (2.1) it is clear that the more stochastic sources are used, the smaller the stochastic error 
becomes. In fact, from Eq.(2. 1) and (2.3) we learn that the errors decrease as O ( -A= J , as expected. 
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Since we also have to deal with the gauge error, we would like to minimize the statistical error by 
increasing the number of stochastic sources N until we reach the gauge noise. For some cases, this 
may result in a large value of N, an expensive choice. 

2.2 The Truncated Solver Method 

The Truncated Solver Method (TSM) [2] increases N at a reduced cost by aiming at a low 
precision (LP) estimation of the inverse \sj) LP = (M _1 ) Lp TJ/), where the inverter is truncated at 
reduced accuracy. The truncation criterium can be a large residual or equivalently a fixed number of 
iterations. This way we can increase the number of sources Nlp cheaply, but we are introducing a 
bias in our estimate due to the truncation. We correct the bias stochastically, by inverting a number 
of sources to high and low precision and taking the difference: 

1 N H p 1 Nhp+N L p 

M W=^I [l^Hp-l^LpliVjl+jfo I \ S j) L p(^l ( 2 - 4 ) 

V * ' V ' 

Correction Biased estimate 

which requires Nhp high precision inversions and Nhp + Nlp low precision inversions. If the con- 
vergence of the solver is fast, we only need a few high precision inversions to estimate properly the 
correction, and then the error falls as O 1 /Nlpj ■ Therefore we want to ensure a good conver- 
gence for the solver; in our case this is ensured by the twisted mass regularization, which introduces 
a lower bound for the eigenvalues of the dirac operator. 

The TSM needs tuning of its parameters, namely the precision of the LP inversions and 
Nhp /Nlp ratio, to get a safe result with maximum performance. For the first parameter we chose 
values already used in the literature, i.e., the residual Plp ~ 10~ 2 [3]. The tuning of the second 
parameter was performed empirically: we took a disconnected diagram we expected to yield a 
large stochastic error, and we optimized Nhp and Nlp so as to get the minimum error at the lowest 
computer cost. As shown in Fig. 1, the error decreases as the number of HP or LP increases. A 
good compromise for this particular diagram is HP =12 and LP = 300 as the cheapest point that 
saturates to the gauge noise. Since the tuning depends on the diagram to be computed, we decided 
to take the more conservative number of 24 for the number of HP sources. 

2.3 The one-end trick 

The properties of the twisted mass action provide a powerful method to reduce the variance of 
the disconnected diagrams. The standard way to compute the disconnected diagrams is to generate 
N stochastic sources rj r , invert them, and compute the diagrams corresponding to operator X as 
h T!r ( r lr^ s r) ~ Tr (M _1 X) , where the operator X is expressed in the twisted basis. However, if 
the operator X involves an isovector combination in the twisted basis, one can resort to the identity 
M u —M d = 2iLiay 5 , which becomes M" 1 —M d l = —2i\iaM~^ Y*,M~ l for the propagators: 

^ t (srYsXs,.) = Tr (M- l X) - Tr (M d l X) + O Q=) . (2.5) 

As a result of this substitution, the fluctuations are reduced by the small Li factor. Most important 
is the implicit sum of V terms in the product M^y^M^ 1 . The difference of propagators exhibits 
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a signal-to-noise ratio of l/y/V, but in the product it becomes V/W^. In fact, a comparison 
between the two methods reveals a large reduction in the errors at the same computer cost [4-6]. 
The drawback of this technique is its inapplicability to operators lacking a T3 flavour matrix in the 
twisted basis. A generalized version of the trick can be developed from the identity M u +Mj = 
2Dw, with Dw the Dirac -Wilson operator without a twisted mass term. After some algebra, 

~ £ {srYsXjsDws,-) = Tr (M^x) + Tr (M d l X) + O (^Jj , (2.6) 
but the lack of the /I suppresion factor introduces a considerable penalty in the signal-to-noise ratio. 



3. Simulation details 

In order to test these methods, we analyzed 4698 configurations of the 555 ensemble of the 
ETMC collaboration. This ensemble is a 32 3 x 64 lattice and was generated with 2 + 1 + 1 dy- 
namical fermions, at pion mass m K 360 MeV and strange and charm quark masses fixed at about 
their physical values. The resulting lattice spacing is a = 0.086(1) fm determined from the nucleon 
mass resulting in m n L ~ 5. The disconnected diagrams were computed by making intensive use of 
a modified version of the QUDA library [7, 8], which implemented new code and kernels to do the 
required inversions and contractions on the GPUs. For the Fourier transform we used the CUFFT 
library. 

The QUDA library allowed for multi-GPU calculations, so 2 GPUs worked in parallel per 
configuration. As seen in the right graph of Fig. 1 , the scaling for a few GPUs is very good, with 
a ~ 90% increase in performace when adding the second GPU. This result holds up to 8 GPUs, 
where there is a drop, beyond that the advantages of adding new GPUs are only useful in the case 
of lack of memory. It is remarkable however that we can reach TFlop sustained performance with 
just a few GPUs. 
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Figure 1: Left: Tuning of the number of HP and LP stochastic noise vectors for the TSM using 50 configu- 
rations of the B55.32 ensemble for the the traceless version of the operator i^y^D^ at a given value of the 
insertion time f, = 8 and sink time t s = 16. The error is shown versus N^p for different values of Nhp marked 
by the different ploting symbols given in the legend. Right: Strong scaling of the multi-GPU code for this 
ensemble. 
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The computations were performed on GPU clusters with NVidia fermi GPUs, mainly Tesla 
M2070 with 6Gb of memory, but also Tesla M2090 and M2050. The noise sources were generated 
on-the-fly, and the propagators were not stored, in order to save storage and I/O time. 

4. The analysis with the summation method 

One of the advantages of the one-end trick for twisted mass fermions is the fact that, since the 
noise sources must be on all sites, we obtained results for all the possible insertions for free. This 
feature enables us to use the summation method to perform the ratio analysis. 

The method is known since a long time [9-11], and requires the knowledge of the three point 
function for all possible insertion times. The advantage is the reduction of the noise due to the 
excited states by an exponential of the sink time, e~ kts , as opposed to the standard decrease with the 
insertion time e~ Kt '. In this method we sum, for every value of t s , the ratios from = up to tj = t s , 
Rsum{t s ) = Lf =o Rpiateau(ti,t s ) ■ Thence the dependence of the ratio on ti dissappears. The ratio 
Rpiateau, computed as the quotient between the three -point function and the two-point function, can 
be written as Rpiatem<(ti,t s ) = Rgs + 0{e~ Kti ) + 0{e~ K ' ts ), where Res is the uncontaminated ratio, 
and the other contributions are the undesired excited states. After performing the sum in tj, we get 
the ratio as a slope Rs um (t s ) = t s R G s + c(K,K') + 0(e~ Kts ) + 0(e~ K '' 5 ), and the contributions of the 
excited states become a geometrical series in tj whose sum decays as t s . Therefore we expect a 
better suppression of the excited states at the same t s . The drawback is that we now need to fit to a 
straight line with two fitting parameters instead of one. 

5. Results 




Figure 2: Left: Charm content for the nucleon, from Rpi a teau(ti,ts)- The grey band is the value obtained 
from the summation method (right). 

We combined the GPU-computed diagrams with nucleon 2-point functions in order to get the 
ratios for g& and . Each disconnected diagram was combined with a set of 5 2-point functions, 
with randomized positions for each one of the 2912 configurations, where the 2-point functions 
were computed for proton and neutron, propagating backwards and forwards. In this manner we 
produced 20 measurements per gauge configuration. The slope obtained in the summation method 
changes as the sink-source separation increases and fitting too early would yield a wrong result. 
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Figure 3: Left: Disconnected contribution to the light c-term of the A from Rpi a teau{ti,tf). Right: The 
strange (J-term of the A. 




Figure 4: Left: Light c-term of the Q. from Rpiateau^i^s)- The grey band is the value obtained from the 
summation method by fitting the slope shown on the right. 




Figure 5: Disconnected contribution to the isoscalar gA using the generalized one-end trick of Eq.(2.6). The 
results are noisier than those obtained for operators calculated using the standard one-end trick of Eq.(2.5). 



The two methods give consistent results, and therefore combining both one can ensure that we 
have a large enough sink-source separation for excited states to be neglected. 

For the (Fig.2), more statistics are needed to understand the change in slope in the summa- 
tion method. In the summation of the A we observed a similar behaviour, but it was quite reduced 
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and the results agree with the plateaus (Fig. 3), even when our statistics were smaller: 4643 con- 
figurations combined with 4 different 2-point functions propagating forwards and backwards (8 
measurements). In contrast, the Q. (Fig. 4) yields a strong signal with the same statistics. 

The generalized version of the one-end trick as expected is more noisy. Our results for the 
isoscalar nucleon axial charge, g'£ are shown in Fig. 5 and are in agreement with recent evaluation 
using Clover fermions [12]. 

6. Conclusions 

The computation of disconnected contributions for flavour singlet quantities has become fea- 
sible, due to the improvement in the algorithms and to the increase in computational resources. In 
this work we show that we can get reliable results for disconnected contributions to the a-terms 
and the isoscalar axial charge. GPUs are particularly efficient for the evaluation of disconnected 
diagrams using the TSM, yielding a huge improvement in the computation of LP inversions and 
contractions. In additon, the one-end trick allows a reduction of the variance at the same com- 
putational cost, as well as getting the fermion loops for all the possible insertion times for free. 
This property, together with the application of the plateau and the summation methods, as well 
as the generalized one-end trick, allowed us to compute nucleon observables wiere disconnected 
diagrams play an important role. 
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